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ABSTRACT 


Digital resource objects (DRO) consider as one of the most useful resources for storing humanity's collected 
knowledge. Many organizations are now aiming to make this data available to individuals. The query 
provided to DROs by the non-expert user, on the other hand, is usually a brief and frequently confusing 
expression of his desire. In DROs, it is not enough to explicitly explain what the user requires. The reason 
for coming up with short user query is that the users usually have limited knowledge and terminologies of 
the specific domain area. The formative terms can be missing inside the user’s query, leading to poor coverage 
of relevant documents. To cover the difference between the query of user and DROs, the semantic query 
expansion method (SQE) is proposed to improve the efficiency of DRO retrieval by enhancing the quality 
level of candidate terms to be inserted semantically to the entire query terms to enhance performance of DRO 
retrieval. The proposed SQE method comprises three steps namely query terms definition, candidate terms 
generation and the proposed correlation algorithm. The aim of the correlation algorithm is to extract the 
semantic terms to extend the query with related terms only. Results from experiment on CHiC2013 and 
ECHi1C2013_EDE collections show that the proposed method can significantly outperform previous methods 


specifically in DROs. 
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1. INTRODUCTION 


DRO indicates to structured information that 
depicts, and simplifies the retrieval, consumption, 
and administration of knowledge resources. Aside 
from contents storage, DROs provide platforms for 
searching, retrieving, and organizing data from 
databases. Standardized resource descriptions aid in 
the search and retrieval of digital information 
resources by characterizing singular files, singular 
objects, or entire groups [1]. DRO content can be 
shared, integrated, and aggregated online, and digital 
file content can be quickly updated. These 
capabilities help users of digitized content by 
improving access to digital libraries and allowing 


them to be reused for research, learning, and 
producing new commercial contents. In the field of 
DRO’s content, it is necessary to address the 
challenge of imprecise and succinct queries often 
posed by non-expert users. For example, users might 
enter queries such as "ancient vase" or "famous 
painting" when searching for information about 
artifacts and artwork. Similarly, queries about 
historical events may include terms such as "World 
War II" or "American Revolution", which indicates 
a tendency for users to search for summary results on 
important events. Cultural practices, such as 
"traditional wedding customs" or the "Japanese tea 
ceremony", can also raise vague inquiries. When 
exploring monuments and landmarks, users may use 
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queries such as "famous monuments in France" or 
"history of the pyramids", reflecting their preference 
for quick access to relevant information. In addition, 
queries such as "Greek Myths" or "Shakespeare's 
Plays" illustrate users' tendency to seek general 
information about literary works and folklore. By 
incorporating these specific query examples, a 
comprehensive understanding of the challenges 
posed by non-expert users in the area of heritage 
culture content emerges. This, in turn, lays the 
groundwork for a solution that can focus on query 
optimization, context-based suggestions, or 
improved query interpretation to improve user 
experiences in accessing and understanding cultural 
heritage information. 


In general, query of user is one of 
several alternative formulations of an 
information demand that a user may have. As a 
result, user queries frequently do not accurately 
reflect the vocabulary of a document that meets 
the information demand. Even if a document has 
the exact information that a user requires, it 
cannot be retrieved if it lacks any of the 
keywords utilized by the user's query. Query 
expansion (QE) method is a method that attempts 
to improve poor queries of user by eliminating 
terms that decrease the retrieval efficiency [2], 
adding terms that aid retrieval [3], re-weighting 
old or new query phrases to change the focus of 
the inquiry [4] or employing a combination of 
those methods [5]. Since DROs suffer from the 
short query problem, this paper will focus on the 
QE method that is concerned with adding and re- 
weighing the terms to the user’s query. Basically, 
QE methods use several sources to provide the 
candidate terms that will be added to the user’s 
query. These sources can be identified by two 
main sources: the first is an external source such 
as Wikipedia and WordNet [6], and the second is 
a set of results obtained from the first-round 
retrieval [7]. During a typical retrieval session, a 
user will repeatedly develop a query that meets 
the information requirement. The initial attempt 
to formulate a query with a particular data 
requirement in mind is frequently inaccurate and 
can result in an answer group that fails to satisfy 
the user's information need [8]. The query may 
have been too broad, resulting in a wide range of 
documents in the answer collection. As a result, 
important materials get mixed in with documents 
which are irrelevant to the topic at hand. 
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Alternatively, the query could have been overly 
specific, returning only a portion of the relevant 
items. A third possibility is that the user's query 
is utterly defective, with no overlap between the 
expected collection of documents and the true 
result set. 


Query expansion (QE) method is a 
method used to increase the effectivity of 
information retrieval (IR) performance by 
reformulating the original user’s query [9]. 
Basically, it is based on the assumption that the 
query written by the user usually retrieves results 
that are mostly irrelevant [2]. QE method has 
been widely studied as an effective way to solve 
the short query problem [10]. To cover the gap 
between the user’s query and DROs various QE 
methods for DRO retrieval have been proposed 
[ll]. Because the effectiveness of QE 
approaches is based primarily on selecting 
suitable candidates who are _ semantically 
connected to the query terms many of the recent 
studies deployed semantic terms on the QE 
method [12]. However, to the best the 
researcher’s knowledge, a few studies in DROs 
turned to semantic terms such as by [13], and it 
has shown that the semantic similarity for query 
expansion can help to enhance the efficiency of 
the DRO retrieval. Similarly, the proposed 
semantic query expansion approach (SQE) aims 
to increase DRO retrieval performance by 
enhancing the level of quality of candidate 
keywords to be appended semantically to the full 
query terms. The semantic concepts are derived 
by combining the suggested correlation- 
algorithm, which is based on certain simple 
sensible Boolean heuristics, with Wikipedia as 
an external resource. The proposed SQE method 
has the same steps as the traditional QE method. 
Nevertheless, it differs from previous work in 
two important points. First, the source of 
candidate terms is the top k documents 
specifically extracted from the content of 
metadata units rather than the metadata title. 
Second, an extra step is involved which is 
concerned with correlation algorithm that is the 
extraction of semantic terms in the context of the 
query. DROs may struggle to capture the 
complexity and dynamic nature of user intent and 
needs. The limitations of relying solely on DROs 
are directly related to challenges faced by non- 
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expert users. Non-expert users often have 
difficulties expressing precise queries, lack 
familiarity with specialized terms, and 
experience evolving information needs. 


Semantic query expansion entails 
understanding the contextual nuances and 
underlying meanings behind the user’s query in 
addition to finding exact matches to search 
terms. By adding a deeper knowledge of 
language semantics, user intent, and _ the 
complicated links between words, _ this 
revolutionary technology goes beyond the 
limitations of standard keyword-based retrieval. 
This strategy tries to bridge the gap between what 
users express in their queries and the rich content 
available in digital resource repositories by 
employing modern natural language processing 
techniques, machine learning, and semantic 
analysis [14]. 


We will delve into the complexities of 
semantic query extension and its implications for 
improving digital resource object retrieval in this 
exploration. We'll look at how modern 
technology are enabling the development of 
smarter, more intuitive search systems capable of 
comprehending human language nuances. We 
will also examine the obstacles and opportunities 
that occur when adopting such methods, such as 
data privacy concerns, algorithmic complexity, 
and the ethical usage of AI in information 
retrieval [15]. 


2. RELATED WORK 


Numerous studies have shown that a 
substantial portion of online searches involve non- 
expert users generating imprecise or brief queries. 
For instance, [16] claimed that the average query 
length is between two and three words. Not 
compatible query and document words, as well as 
brief searches, can have a significant impact on the 
process of locating suitable documents. Moreover, a 
study by [17], [18], [19] highlighted that users often 
face challenges when formulating queries that 
accurately convey their information needs, leading to 
less effective search outcomes. Several researches 
reported the progress of query expansion (QE) 
methods with some modifications, such as 
application of semantic analysis that leads to 
enhance efficiency of document retrieval via 
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conventional QE method. Based on_ similarity 
thesauri approach, [20] proposed a probabilistic 
semantic QE method that is developed in an 
automatic manner. The similarity thesauri method 
depends on the knowledge domain regarding a 
certain collection in which it was developed from. 
The terms with most similar to the query concept are 
used to expand the query instead of choosing terms 
that are similar to the very query terms. [21] 
proposed a sematic QE algorithm in order to enhance 
document search in massive repositories wherein the 
algorithm restructures the query via WordNet. [22] 
proposed a semantic QE technique that employs 
terms that reflecting similar expression or similar 
semantic by using Wikipedia and WordNet3 
functioned as external sources. An Indian Patent 
database (excel format) was analyzed and the 
similarity of terms were determined via Cosine 
similarity and Extended Jaccard coefficients. The 
outcomes from expanded queries and the cosine 
method outperformed non-expanded queried and 
Jaccard coefficient respectively. [6] presented a 
novel method in order to extract more term 
correlation via Markov network for QE. The 
extracted term correlation which is derived from 
Wikipedia is amalgamated to a_ pre-built 
fundamental Markov network via single local 
corpus. Because of the user's lack of content and 
terminology knowledge across DRO collection, 
either too many or too few results are obtained from 
retrievals. Some studies have tackled the issue of 
DRO collection accessibility using the traditional 
QE approach to improve collection content retrieval. 
[23] developed a semantic QE with three versions for 
picking keywords from Wikipedia that incorporates 
extra similarity metrics and integrates both external 
and internal evidences for QE. The investigation 
included two collections, CHIC-2012 and CHIC- 
2013, demonstrating that the proposed strategy 
improves retrieval performance. [24] described 
varied retrieval ways to test the relative quality of 
different QE and improvement in semantics 
procedures. A variety of strategies based on blind- 
QE were used, with Wikipedia serving as the 
external resource. Different methods were employed 
to choose the most suitable concepts to be 
incorporated to the initial query [25]. The 
experiments were conducted on a CH content test 
collection. The experimental outcomes show that 
expanding the combination of pseudo-relevant 
documents and external resources enhances retrieval 
performance, with the influence of outside resource 
expansion being more substantial. [26] To tackle the 
term-mismatch problem, a semantic model based on 
semantic associations between indexing terms was 
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proposed. The model alters documents based on a 
query of user and some knowledge of semantic term 
relationships. It adds to the document by 
incorporating query phrases that are not present in 
the document but are semantically correlated with a 
minimum of one document keyword. It then 
combines the updated document with two LM 
smoothing algorithms, Dirichlet and Jelinek-Mercer 
(JM). The test was conducted using a variety of 
CLEF corpora from the medical area. In terms of 
retrieval performance enhancement, the 
experimental results show a significant improvement 
over standard LMs and appear to be superior to 
translation models. [4] presented a semantic 
enhancement QE approach that includes Wikipedia 
term linkages into LM. It handles brief queries that 
cannot express a precise information requirement. 
This method seeks the optimal phrases for a query in 
order to semantically enhance the topic and guess the 
user's information needs or query intent [14]. The 
experiment was carried out on a collection of CH 
content test results. The experimental results 
demonstrate minimal change when using the Porter 
stemming approach on term links since the 
difference between both outcomes is extremely 
minor, however the outputs of the semantic 
enrichment approach suggest that utilizing links out 
is more effective than using a mix of in and out links. 


3. PROPOSED SEMANTIC QUERY 
EXPANSION METHOD 


The flowchart of the proposed SQE method is briefly 
presented in Figure 1. Also, the pseudocode of the 
SQE method is shown in Algorithm 1. When queries 
become lengthier, there is a greater chance that some 
significant terms will appear in both the query and 
the corresponding documents [27]. Furthermore, 
these searches frequently contain confusing phrases. 
As a result, using the initially entered user query to 
obtain relevant pages is nearly impossible [28]. The 
proposed SQE method comprises three steps namely 
query terms definition, candidate terms generation, 
forming top N results and the correlation algorithm. 
The aim of the correlation algorithm is to extract the 
semantic terms to extend the query with related terms 
only. Each stage will be demonstrated in the 
subsections that follow. 


The "Semantic Query Expansion Method 
for Digital Resource Objects Retrieval" is an 
information retrieval approach that uses semantic 
analysis to improve search precision. It broadens 
user queries by employing pretrained algorithms to 
identify semantically relevant phrases. The approach 
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discovers contextually comparable phrases by 
averaging semantic representations of query terms. 
These words are added to the original query, 
resulting in a set of enlarged Queries. When applied 
to a set of digital resources, expanded searches return 
relevant results that contain either the original or 
expanded terms. As a result, the approach provides a 
more comprehensive and context-aware set of 
results, improving the search experience for digital 
resource items. 


3.1 Query Terms Definition 


The first step of the SQE method is to get 
the definition for each user’s query term. To 
accomplish this, a list of 571 stopwords must be used 
to remove all stopwords from the queries, such as 
“a”, “about”, “the”’ and “her”. Identifying the 
“meaning” of every term in the query of user due to 
semantic purpose is a important step for semantic 
retrieval. In order to find the definition for each term 
belonging to the user’s query, Wikipedia is used as 
an external resource for the purpose of providing 
articles related to each keyword in the query of user. 
After obtaining the terms from the user’s query, each 
term is sent separately to Wikipedia to retrieve the 
related articles for each term. Thus, each term has its 
corresponding articles. 


Let QO = {t,,t, ...t,} where Q is the query, 
t; is the query term and n is the number of query 
terms. To get definition D for each t € Q, each t is 
sent to Wikipedia to retrieve related articles. Then, 
from the top three articles, the first paragraph is 
considered from every retrieved article to obtain 
definition D. This work follows the work of [29] in 
which the first three articles are employed to 
determine the range of candidate terms. 


The definition terms are listed in a set C that 
contain definitions as elements c= 


(Dy, Dae,+D3e, )» (Dregs Dae, Puy)» (Pregr D2igr Pa, )- 
Later, set C will be an input for the correlation 
algorithm. 


3.2 Identifying the Candidate Terms 


The aim of this step is to generate candidate 
terms to be inserted to the query of user. The core of 
this step is to extract the keywords from the top k 
documents obtained from the first-round retrieval 
results. It is built based on two assumptions. The first 
assumption says that top the k documents are the 
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Query Terms Definition 


Original 
query 


Wikipedia 


Term 2 
description 


Term N 
description 


Term | 
description 


Correlation Algorithm 


Expanded query 


Extracted 
keywords 


Figure 1: Semantic query expansion method 


closest documents to the query [30] [31]. And the 
second assumption says that the frequency of the 
term in a single document determines how important 
the term is to the query [32]. 


To select the top k documents D = 
{d,,dz,...,d,}, the k value needs to be determined 
where the best value for variable k is 10 and 
therefore, the first 10 retrieved documents from the 
results (first-round results) are the best range to 
extract the candidate terms [7]. Next, the domain of 
candidate terms is selected and the keywords (the 
degree of importance of a term in a document) have 
to be extracted from the selected domain (top k 
documents). The keyword is estimated based on the 
frequency of the term itself to the entire text of the 
document. The Frequency Inverse Document 
Frequency (TF-IDF) is a measure of informative 
terms in the documents. It is a basic measure that is 
used in almost all QE methods to calculate the 
weights of terms, 1.e. user’s query terms or candidate 
terms [33] and it is given as: 


TF — IDF = tf(t) (1) 


N 
* Log — 
Taf 
where: 


tf(t) : Number of times term ¢ appears in 
document d, 
N : Num of docs d in the collection, 


df(t): Num of docs d in corpus containing ¢ and 
df(t) = 1. 


Note that among the candidate keywords, as usual 
100 keywords are taken to be candidate terms for 
each original query. The input of this step is the top 
k documents, and the output is the weight of the 100- 
candidate term. 


3.3 Correlation Algorithm 


Each user’s query term definition obtained 
from the query terms definition step and the 
candidate terms coming from identifying the 
candidate terms step will be used as inputs to the 
proposed correlation algorithm (CA) to determine 
the important terms from each candidate terms that 
are to be added to the user’s query. The proposed 
algorithm depends on some simple sensible Boolean 
heuristics. 


Let Q be a set of query terms t;,C be a set 
of definition D,T be a set of the weight of candidate 
terms, and R& be a set of the relation describing which 
term belongs to which definition(s). 

Q = {ty t2 ..trh 
Dy Bi Dig ll Di Oa Pay.) scl Deg Dn De), 


T= Lipiiy twa) tw3 nee Liens 


Timmons SR eT 
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Algorithm 1: Proposed Semantic Query Expansion (SQE) Method 


1 Input: User Query Q = {t,,t, ...t,}, Wikipedia, top 10 documents D = {d,,d3,...,d3} 
2 Step 1: 

a For each t, € Q do 

4 Send t,, to Wikipedia // send it as query 

5 Return related articles 

6 Select the top 3 articles C = {T,, Tz, T3} 

7 Extract the first paragraph C = {(D, tn? 2tm? P3 nf // read the first paragraph 
8 Return C 

9 Step 2: 

10 For each d € D do 

1] Tokenize/split the text into terms J = {t,, t,, tz ...t,,} 

12 For each t,, € J do 

13 Compute ‘» weight by using TF — IDF = tf (t) * log GO 

14 Rank the terms based on the weight JT = {t,1, tw, tw3 «-twm} 

i Return T = {Gy two; bya twins // top 100 terms 

16 Step3: 

17 For each ty, € J do 

18 Ift,, € V{x:R} then 

19 Qe = QUtn 

20 Else 

21 If t,. € V{x: R} then 

22 Qe =Q 

23 Else 

24 Ift, € A{x:R} then 

25 VDz,, € € do 

26 e= Dp, UD2,, U Ds,,) U (D,,, U D2, U Ds,,) Wied (D,,, U D2, U Ds,,.) 
ai Din=1(tmDm) 


Compute SIM(t,,,C) = 


Linea thr’ Lame Dm 
28 Rank the t,,, based on the SIM (t,,,C) value 
29 Return QO; 


30 Output: Expanded query Q; 


— (ta, Dae,» Doey» Dae)» (tar Daeg» Daeg Daeg) iil. Otherwise, when t,, € 4{x:R}, where x is a 

(ey pis 5205.) pair of definitions in R, then all query terms 

It means that D,,,D2,,D3, define , definitions in C are merged, and D which 1s a set 
1 1 1 


of the common definitions is generated. The 
merging process starts from inside the pair of 
definitions in definitions in C. It is taken as the 
union of definitions corresponding to specific 
term (D, rom, D, tn U D3 a in order to get rid of 


dD, to? D, ty? Dz bs define t and so on. 


The detailed steps of CA are as follows: 
i. If the candidate term t,, occurs in all query 
terms’ definitions, t,, € V{x:R}, where x is a 


il. 


pair of definitions in R, then the query Q will be 
expanded to Q, = {t,,t,..ty,tm} with t, 
directly. 

If the candidate term t,, does not occur in all 
query terms’ definitions where t,, € V{x:R} 
where x 1s a pair of definitions in R, then t,,, will 
be discarded. 


term duplicates, and then the union of the union 
definition terms together in set C are as follows: 


CSD 2D Dis.) [Dip Maci De) 
(Dy, Do.,Da.,) 
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C= (D,., U D2, U Ds,,) U (D,,, UD2,, U Ds,,) U..U 


(D,,, U D2, U Ds, ) 


Then, the similarity SIM (t,,,, D) between t,, and 
common definition D is calculated using cosine 
similarity. It is a similarity measure that 
quantifies the cosine of the angle among two 
non-zero vectors in an inner product space. It is 
especially useful in positive space, where the 
result is neatly bounded in. The reason behind 
using it in this work is that it is the best measure 
compared to other similarity measures as 
reported in [34]. Also, cosine similarity measure 
works well when it cooperates with TF-IDF. In 
cosine similarity, each candidate term is 
handled as a single query, and set D_ that 
contains the terms’ definition will be considered 
as a document. Therefore, cosine similarity will 
find the distance between each candidate term 
and D, and the candidate term with the shortest 
distance (the cosine value) is considered to be 
the closest to the document. Basically, the 
cosine equation is based on two coefficients: the 
weight of the candidate terms and the weight of 
the terms definition. For the first coefficient 
(weight of the candidate terms), the weights are 
found earlier as described in (Section 3.2) while 
the other coefficient is still not calculated. So, 
before using the cosine equation, we must find 
the second coefficient values. In the same way 
that is used to calculate the first coefficient, the 
weights of the second coefficient also 
calculated. The cosine similarity is 
mathematically expressed as: 
t 
SIM(tm»Dm) = _ Ximar(tm* Pm) (2) 


t . t 2 
m=1 tm m=1 0m 


where: 

t : Length of definition set D, 

Dm : Weight of the term m in definition set 
D, 

tm : Weight of the candidate term m. 


Then, the candidate terms will be ranked based 
on its similarity with the closest terms are those 
having the highest similarity value, i.e. those 
with low angle and shortest distance. 

iv. The last step in the SQE method is to add the 
closest candidate terms to the original query, 
and then automatically send them to the 
collection to retrieve the second-round results. 
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4. EXPERIMENTS, RESULTS AND 
DISCUSSION 


Several experiments have been conducted 
to illustrate the efficacy of the suggested SQE 
approach. The major goal of the assessment is to 
demonstrate that the suggested SQE approach can 
affect the efficiency of the DRO retrieval. To handle 
the CHiC2013 and ECHiC 2013, _ typical 
mechanisms were created and implemented _EDE 
collections. For the evaluation purposes, the DS 
model is being used as the retrieval model to retrieve 
the CHiC2013 documents. Top 10 retrieved 
documents are to be obtained after query expansion. 
Figure 2 describes the experimental methods used to 
thoroughly assess the proposed SQE method 
system's performance against the existing methods. 
The first branch (IRS-1) is a normal IR system 
without QE, and the second branch (IRS-2) 
represents the IR system that applies QE, and both 
of them are considered as benchmarks. The third 
(IRS-3) and fourth (IRS-4) branches involve the 
suggested SQE method and they respectively use 
CHIC2013 English collection and its extended 
version ECHiC2013_EDE collection. The detail of 
the benchmarks will be discussed in the next section. 


4.1 Experiment Setup 


As a collection of tests, this study used 
CHIC-2013 English collection and its expanded 
version, ECHiC2013_EDE collection. The 
collection includes 1107 documents and 22 
evaluation queries (on average, 1.6 terms per 
query). The evaluation queries represent each 
document's relevancy for each query. For all of 
the tests, the efficiency of the proposed method 
are reported using the three standard measures: 
Mean Average Precision (MAP), Precision at top 
10 documents (p@10) and the Precision-Recall 
curve [35]. Two-tailed paired t-test is used to test 
whether the differences between the 
performances of the methods are statistically 
significant. An external resource such as 
Wikipedia (the English Wikipedia) was used to 
generate the related articles. Two IR system 
benchmarks were used. In both IR systems, the 
CHiC2013 collection is retrieved by the 
language model as described in [36].The first IR 
system (IRS_1) does not use the QE method and 
therefore only the initial results are subject to 
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Constructing the Inverted File 


IRS 2 IRS 3 IRS 4 
DS Model DS Model DS Model 


Ranking of 
Documents 


DS Model 


Evaluation 


(Benchmark) 
(CHiC2013) 


Evaluation 
(Benchmark) 
(CHiC2013) 


Ranking of Documents 


Evaluation 
(ECHiC 2013. EDE) 


Evaluation 
(CHiC2013) 


Figure 2: Methodology of the experiment (the labels refer to the branch number). 


evaluation (the first-round results) while the 
second IR system (IRS _2) employs the QE 
method proposed by [37] and the second round 
results are subject to evaluation. The third IR 
system (IRS 3) employs the proposed SQE 
method. Furthermore, for the IR system handling 
ECHiC2013_ EDE (IRS 4), documents were 
expanded using the SQE method. The 
experimental results for the proposed SQE 
method will be presented in detail in the next 
subsections. Table 1 presents some statistics of 
the test queries. 


Table 1: Statistics of the test queries. 


Parameter Name Value 
Number of testing queries 22 
Number of Wikipedia articles 3 
Average number of query terms 1.6 
Number of candidate terms 100 
Number of expanded terms 10 


4.2 Experiment Results 


The experiment rates the text documents 
based on 22 short inquiries. Table 2 displays the 
outcomes for IRS-1, IRS-2, IRS-3, and IRS-4. 


According to this table, comparing MAP and P@10 
for the suggested SQE approach to MAP and P@10 
for the benchmarks results in a_ significant 
improvement in retrieval performance. In addition, 
Figure 3 shows the Precision-Recall curve for IRS- 
1, IRS-2, IRS-3 and IRS-4. Based on this figure, the 
precision of IRS-4 gains higher than IRS-3, IRS-2 
and IRS-1 at different recall points, and it shows that 
the SQE method helps to improve the performance 
of DRO retrieval in both collections, CHiC2013 and 
ECHiC 2013 _EDE. It is observed that the IR system 
for DROs achieves a significant performance when 
the QE method is employed especially when 
semantic terms work with the QE method as in the 
SQE method. As aforementioned, it is worth to 
highlight that the retrieval results on ECHiC 
2013 EDE are better than the retrieval results on 
CHiC2013. The proposed SQE method depends on 
the top 10 results to determine the candidate terms. 
The effectivity of the candidate terms will be more 
effective and closer to the user’s query. All at once, 
the results show that the recommended SQE 
approach is more suitable for DRO collection and 
better than the QE method. 


2203 


Journal of Theoretical and Applied Information Technology 
31% March 2024. Vol.102. No 6 


SZ 


© Little Lion Scientific 


ISSN: 1992-8645 


2. = 
wrviaa 


E-ISSN: 1817-3195 


Table 2: and P@10 MAP for IRS-1, IRS-2, IRS-3, and IRS-4 


Information Retrieval System MAP P@10 
IRS-1 0.502 0.419 
IRS-2 0.540 0.488 
IRS-3 0.590 0.531 
IRS-4 0.741 0.656 
Improvement (IRS-2, IRS-1) 3.800% 6.900% 
Improvement (IRS-3, IRS-1) 8.800% 11.200% 
Improvement (IRS-3, IRS-2) 5.000% 4.300% 
Improvement (IRS-4, IRS-1) 23.900% 23.700% 
Improvement (IRS-4, IRS-2) 20.100% 16.800% 
Improvement (IRS-4, IRS-3) 15.100% 12.500% 
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Figure 3: Comparison of IRS _1, IRS SI_2, IRS _3 and IRS_4 using averaged 9-point Precision Recall curve. 


4.3 Statistical Significance Analysis 


The outcomes of the t-test for all pairs of 
approaches are depicted in Table 3. It is worth noting 
that in all circumstances the p-value is less than 0.05 
with respect to the MAP measure which implies that 
the retrieval performance of IRS 3 and IRS 4 are 
much superior to the corresponding comparator 
standards (IRS_1 and IRS 2). 


As a result, because all of these p-values were less 
than 0.05 at a 95% confidence level, it is proven that 
the IRS 3 and IRS 4 (with the proposed SQE 
method) statistically outperform IRS_1 (without 
QE) and IRS_2 (with QE). Moreover, for the P@10 
measure, it is clear from the table that not all cases 
have p-value less than 0.05. For example, the 
experiment for IRS 3 vs IRS 2 has p-value equals 
to 0.1344 and the experiment forIRS 4vsIRS_ 3 has 
p-value equals to 0.1431. 
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Table 3: Paired t-test analysis on both measures MAP and P@10 for IRS_1, IRS_2, IRS_3, and IRS_4. 


Measure 
Information System MAP 10 
t-value P-value t-value P- value 
IR 2 vsIRS 1 3.953 7.263E-04 2.300 2.936E-02 
IRS 3 vs IRS_1 9.985 1.989E-09 4.431 1.401E-04 
IRS_ 3 vs IRS 2 9.442 3.365E-10 1.547 1.344E-01 
IRS 4 vs IRS_1 12.280 8.624E-13 5.337 1.104E-05 
IRS 4 vs IRS 2 11.623 7.309E-11 2.479 1.971E-02 
IRS 4vs IRS 3 5.347 2.653E-05 1.506 1.432E-01 


Hence, there is no significant difference between 
the performance of IRS 3 and IRS 2, and between 
the performance of IRS 4 and IRS 3. Both IRS 3 
and IRS_2 exhibit similar performance, the same is 
true for both IRS 4 and IRS_3. In addition, the p- 
values are less than the level (p 0.05), indicating that 
there is a variance between comparing techniques' 
averages and that the improvement in both MAP and 
P@10 did not happen by accident (with the 
exception of the prior two cases). A p-value of 
0.0000000003365%, for example, indicates that 
there is only a (0.0000000003365%) chance that the 
findings of the IRS-3 vs IRS-2 experiment happened 
by coincidence. Because the P_values in all of the 
suggested approaches are low, it has been 
demonstrated that the results did not occur by 
coincidence and that enhancements are meaningful. 


5. CONCLUSION 


In this study, a SQE approach for 
improving DRO retrieval performance is proposed. 
The SQE approach aims to overcome the short user 
query problem, which has a detrimental impact on 
retrieval performance. The recommended SQE 
method enhances the level of quality of the potential 
terms to be included semantically to the entire query 
terms, and the resulting semantic terms are obtained 
through the use of the proposed correlation 
algorithm, which is based on some simple and 
effective Logical heuristics, and Wikipedia as an 
external resource. The SQE approach can improve 
the performance of DRO retrieval. There were two 
IR benchmarks used. Because the primary IR does 
not use the QE approach, just the initial results (the 
first-round results) are evaluated, whereas the 
second IR uses the usual QE technique and the 
second-round findings are evaluated. 


The P@10, MAP, and Precision-Recall curves were 
employed for evaluation. As datasets, the CHiC2013 
and ECHiC2013 EDE sets are used. The results 
suggest that the proposed technique can improve 
DRO retrieval performance when compared to other 
benchmarks. 
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